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ABSTRACT 

Revealing the clonal composition of a single tumor 
is essential for identifying cell subpopulations with 
metastatic potential in primary tumors or with 
resistance to therapies in metastatic tumors. 
Sequencing technologies provide only an overview 
of the aggregate of numerous cells. Computational 
approaches to de-mix a collective signal composed 
of the aberrations of a mixed cell population of 
a tumor sample into its individual components are 
not available. We propose an evolutionary frame- 
work for deconvolving data from a single genome- 
wide experiment to infer the composition, 
abundance and evolutionary paths of the underlying 
cell subpopulations of a tumor. We have developed 
an algorithm (TrAp) for solving this mixture 
problem. In silico analyses show that TrAp correctly 
deconvolves mixed subpopulations when the 
number of subpopulations and the measurement 
errors are moderate. We demonstrate the applic- 
ability of the method using tumor karyotypes and 
somatic hypermutation data sets. We applied TrAp 
to Exome-Seq experiment of a renal cell carcinoma 
tumor sample and compared the mutational profile 
of the inferred subpopulations to the mutational 
profiles of single cells of the same tumor. Finally, 
we deconvolve sequencing data from eight acute 
myeloid leukemia patients and three distinct 
metastases of one melanoma patient to exhibit the 
evolutionary relationships of their subpopulations. 

INTRODUCTION 

The mechanisms of cancer evolution and metastatic onset 
are still largely unknown. The diversity, complexity and 
evasive nature of tumor biology are central reasons for the 
seemingly slow progress in the cure of most cancer types, 



particularly in controlling the ability of tumor populations 
to spread. Tumor populations are dynamic aggregates of 
constantly evolving subclones, each carrying a variety of 
genomic aberrations (1-35). This genetic heterogeneity is 
often associated with differences in the biological behavior 
of different cell subpopulations. Some of these subclones 
are likely to be the primary instigators of invasion, 
metastases or relapse following treatment (35-52). An 
effective characterization of the aggressive potential of 
tumors at early stages has an enormous potential to 
guide new clinical interventions and translational 
research (53-61). 

Recently, several efforts have been made to provide a 
complete genealogical perspective of cancer evolution 
(62-66). Using fluorescent labeling techniques, or more 
recently, single-cell sequencing, it is technically possible 
to separate single cells from tumor samples to investigate 
their evolutionary patterns (62-71). However, these 
approaches are limited to either a small number of fluor- 
escent markers (63,72) or to a relatively small number of 
single cells. On one hand, the identification and selection 
of uncharacterized subclones in high-throughput experi- 
ments is beyond the capabilities of current cell-sorting 
technologies; on the other hand, isolation and profiling 
of enough single cells to obtain a statistically representa- 
tive sample of a tumor composed of millions of cells has, 
currently, prohibitive costs. For this reason, genomics 
profiling of tumors still relies on pooling to provide 
global averaged signals over the subclonal population 
within a tumor sample (73-76). Computational methods 
for identifying subclones, quantifying their relative abun- 
dance and monitoring their emergence and dynamics 
could prove extremely useful for the analysis of the het- 
erogeneity of these pooled samples. This problem has been 
often overlooked due to its mathematical complexity. 

We present a mathematical approach to de-mix signals 
from heterogeneous cell populations into their subclonal 
components and subsequently unveil the underlying 
dynamic tumor heterogeneity. Our proposed method is 



*To whom correspondence should be addressed. Tel: +1 203 737 6262; Fax: +1 203 785 6486; Email: yuval.kluger@yale.edu 
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. 
© The Aulhor(s) 2013. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/licenses/by/3.0/), which 
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 



el65 Nucleic Acids Research, 2013, Vol. 41, No. 17 



Page 2 of 15 



related to the problem of blind source separation (77-86), 
where both the underlying sources and their relative com- 
position are unknown. In contrast to blind source separ- 
ation methods, we cannot assume that the underlying 
sources are statistically independent, we have no prior 
knowledge of the number of sources and we have at our 
disposal only one mixture of the unknown sources. This 
mathematical problem has a vast number of solutions and 
can be addressed only if additional constraints are 
imposed. Solutions to this problem can be found by 
applying Bayesian methods such as hierarchical Dirichlet 
Processes (66,87). While such approaches typically 
produce plausible solutions to the problem, they require 
knowledge of several parameters and prior distributions, 
which are often not easy to calibrate. Futhermore, sto- 
chastic methods are not guaranteed to find the optimal 
solution(s) to the problem and may miss many solutions. 
Herein, we introduce biologically meaningful constraints 
to dramatically reduce the number of solutions to the 
problem, and we provide an algorithm to find all solutions 
of this reduced problem. In detail, we assume that tumor 
cell populations develop in a parsimonious evolutionary 
process. Furthermore, based on empirical observations, 
we introduce a sparsity constraint that limits the number 
of subpopulations. Distinctively from the standard 
problem of phylogeny (88-99), where each species is 
observed and measured separately, and differently from 
cases where multiple aggregate samples have been 
measured (100-106), our methodology, which we term 
Tree Approach to Clonality (TrAp), is specifically 
designed to deconvolve a single aggregate signal into its 
different subclonal components. TrAp incorporates bio- 
logically motivated constraints, which allow us to infer 
(i) the composition of the subclones in a single aggregate 
sample, (ii) the abundance of each subclone and (iii) the 
evolutionary path of the subclones. 

The article is organized as follows: we first define the 
subclonal deconvolution problem and we present an effi- 
cient algorithm for finding all its solutions in the 'Results' 
section. Using in silico simulated data we verify that the 
algorithm is able to correctly deconvolve mixed subclonal 
populations and that the method is robust to realistic 
measurement errors. Further, the solution is often 
unique when the number of populated subclones is 
moderate. In addition, we also show that TrAp can cor- 
rectly deconvolve random mixtures of karyotypes of 
several cells from the same tumor biopsy or from 
mixture of sequences generated in a study involving 
somatic hypermutations (SHMs) in B cells. We subse- 
quently compare the mutation profiles of 20 Exome-Seq 
single-cell experiments to those inferred using an aggregate 
signal generated by exome sequencing from the same renal 
cell carcinoma tumor. We then apply the TrAp algo- 
rithm to study the response to chemotherapy of eight 
acute myeloid leukemia (AML) patients. Finally, 
we apply TrAp to Exome-Seq data from three meta- 
stases from three distinct body compartments and 
compare their subclonal compositions and evolutionary 
histories. 



The subclonal deconvolution problem 

We consider a population of cells where each cell can be 
described by a binary vector, which we call genotype. Each 
element of the genotype vector has an aberration state 
modeled as a binary variable, e.g. the presence/absence 
of a mutated nucleotide in a specific genomic position or 
the presence/absence of a specific copy number variation 
event in a specific locus. For each cell, the z'-th element of 
the genotype vector is 1 if the z'-th aberration is present in 
the cell and 0 if the aberration is absent. A subclone is a 
collection of all cells that have identical genome-wide 
aberration profile. A subclone is populated in the sample 
if the fraction of cells sharing the subclone's genome is >0 
and can be detected by the experiment. 

We define the subclonal deconvolution problem as the 
task of de-mixing an aggregate measurement into a 
linear combination of (unknown) subclonal genotypes. 
The only information that is required as input is the ag- 
gregate frequency vector y, whose elements y,- correspond 
to the frequency of the z'-th aberration in the sample cell 
population. For efficiency, we remove from the genome 
all genotype entries k that are homogenous within the 
population (i.e. = 0 or = 1), as they do not need to 
be deconvolved. Next, to ensure that the aberration-free 
noncancerous cells (wildtype) are included in the solution 
of the problem, we add one dummy aberration to all the 
normal and cancerous cells in the sample. By construction, 
the aggregate frequency of this dummy aberration y x is 
equal to 1. Finally, without loss of generality, we sort 
the aggregate frequency vector y in descending order 
such that 1 = y\ > yi > . . . > }'n > 0, where N is equal 
to the number of aberration events considered including 
the dummy aberration y\. The subclonal deconvolution 
problem can be written in matrix notation as 

y = Cx, (i) 

where C is a matrix of size N x M whose elements cy are 1 
if aberration i is present in subclone C Jr and 0 otherwise; N 
is the size of the vector y; M is the number of subclones; 
and x is a vector of size M where each element Xj corres- 
ponds to the frequency of subclone C, in the sample. We 
note that, without introducing the wildtype aberration, the 
wildtype subclone would correspond to a vector of zeros 
and we would not be able to infer the frequency of the 
wildtype component using the linear model of Equation 
(1). Furthermore, because the dummy aberration is 
present in the wildtype and all other subclones, it 
follows that (i) V/, cy = 1 and (ii) ^j x j— !• We note 
that because M, C and x are all unknown in this 
problem, there is an intractable number of possible solu- 
tions. As previously discussed (107), for M > 2, the system 
is underdetermined and the aggregate signal cannot be 
uniquely deconvolved by solving the linear system, and 
it is not even possible to find parsimonious unique solu- 
tions using sparse reconstruction methods. However, by 
introducing additional biologically motivated constraints 
to the model, we can dramatically reduce the number of 
possible solutions, such that the problem may have a 
tractable number of optimal solutions, ideally only one. 
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Figure 1. A schema of deconvolution of the mixed signal of four 
subclones. In this example, the aggregate signal frequency vector y 
on the left side of the matrix-vector equation represents the frequency 
of five aberrations in the aggregate sample. To allow the heterogeneous 
mixture of subclones to include normal cells we introduce a dummy 
aberration that is present in any cell. The frequency of the dummy 
aberration y, is equal to one. The frequencies of the actual five aber- 
rations A 2 , A 3 , A 4 , A 5 and A 6 encoded in the remaining elements of the 
vector Y are given by y 2 = 0.6, yi = 0.4, 14 = 0.35, ys = 0.3 and 
_1'6 = 0.1, respectively. In this example, the optimal TrAp solution is 
unique and has four populated subclones: C 2 with aberrations {A 2 }, 
C4 with aberrations {A 2 ,A4}, C5 with aberrations {A 3 ,As} and C(, 
with aberrations {A 3< A6}. The optimal solution is shown both as an 
evolutionary tree (left) and in matrix form according to Equation (1) 
(right), where the tree topology is encoded in the binary matrix and the 
relative composition of the subclones is represented in the column 
vector. 

We therefore seek the family of solutions (TrAp solutions) 
that sequentially satisfy the following constraints: 

(1) Evolutionarity. The subclones are generated in an 
evolutionary process starting from a subclone with 
no aberrations. Every subclone is generated from 
an existing subclone by adding to it a single new 
aberration event. 

(2) Parsimony. The number of subclones M that are 
generated during the evolution process is minimal. 

(3) Sparsity. The number of populated subclones P (i.e. 
the number of subclones j for which Xj > 0) is minimal. 

(4) Shallowness. The depth of the evolutionary tree 
(i.e. the number of generations) D = max, (YJ,- eg) is 
minimal. 

A schema of a TrAp solution is shown in Figure 1. 

The evolutionarity constraint is used in many biological 
systems, in particular when studying development of cell 
populations (90-97,100-102). The parsimony constraint is 
typically satisfied because the expected probability of a 
specific aberration event in a nucleotide or a locus is 
low and it is therefore unlikely that an event occurs 
more than once and independently in distant subclones. 
This constraint is the main criterion used to determine the 
optimality of maximum parsimony methods commonly 
used in phylogeny (88,89,93,98,105). The parsimony 
constraint dramatically reduces the number of possible 
solutions of Equation (1) because it limits the number of 
subclones M. The sparsity constraint is justified by the fact 
that some subclones may die out or may be too rare to 
be detected. Also, it has been shown in several studies 
that few subclones acquire an evolutionary advantage 
and outgrow the other subclones (5,12,108-112), thus 
reducing the number of populated subclones. The shallow- 
ness constraint is justified as strong genomic instability 
may not be viable, thus evolutionary trees tend to be 
shallow and wide rather than deep and narrow. 



MATERIALS AND METHODS 

In this section, we describe the procedures used to prepro- 
cess input data for the TrAp algorithm. 

Cytogenetic data 

The cytogenetic data were obtained from the Mitelman 
database, which contains 61846 biopsies as on 15 
August 2012. We accessed the database through the 
Cancer Genome Anatomy Project (CGAP) Web site 
(113), and we filtered out 29 842 biopsies with uncertain 
calls (indicated by a '?' in the karyotype data). We focused 
our search only on aberrations that are binary by nature, 
namely chromosome deletions and translocations. For 
each biopsy, we performed 100 in silico simulations in 
which we mixed the subclones using random nonnegative 
coefficients. 



SHM data 

SHM sequencing data were derived from B cells undergo- 
ing SHM, a process that leads to high-affinity antibody 
molecules (114,115). In detail, we analyzed sequences 
of the V(D)J region extracted from the same germinal 
center from the sample 1 1930dl6_4print.2, which was 
sequenced by Anderson et al. (116,117). The sequences 
were aligned using the alignment tool of the international 
ImMunoGeneTics information system® (IMGT) 
(118,119) to properly align the V, D and J regions. We 
selected eight sequences that were aligned to the same 
V(D)J sequence Vj(Di)Ji. Because these sequences are 
from the same germinal center and align to the same 
V(D)J sequence, they are expected to stem from a single 
naive B cell and have evolved through the SHM process. 
From the sequencing experiment, 20 mutated nucleotides 
were identified. Furthermore, polyallelic mutations 
were found at position 170, where both A — > C and 
A — >■ G mutations were observed. Next, we considered 
the seven unique sequences (sequences five and eight 
were identical) as representatives of the genome of seven 
different subclones. 



Exome capture sequencing data 

Exome-capture data (120) were obtained from a recent 
clear cell renal cell carcinoma (ccRCC) study (64) and 
from the melanoma patient YUHUY of the Yale 
cohort, for which DNA from normal circulating lympho- 
cytes and three tumor metastases (TM1, TM3 and TM4) 
were subjected to exome-capture Illumina Hi-Seq 
sequencing (121). 

Exome-Seq reads from the aggregate samples of the 
ccRCC patient were combined and aligned to the human 
reference genome (assembly hgl9) using Bowtie (122) with 
parameters '-n3 -kl -m20 -132 — chunkmbs 1024 --best 
—strata 1 . The frequency and coverage of each point 
mutation was computed using VarScan (123). We 
further selected the mutations that were validated by Xu 
et al. (64) by polymerase chain reaction (PCR) validation 
(Supplementary Table S3B) and by bioinformatics 
analysis (Supplementary Table S3A), whose genomic 
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coordinated were realigned to the assembly hgl9 using the 
Lift-Over tool of Galaxy (124). 

For the melanoma patient YUHUY (121), we selected 
19 mutations that were populated in the normal sample 
(i.e. y > 0), had high sequence coverage (i.e. >200 reads 
covering the specific nucleotide) and were localized on 
chromosome 18. 



RESULTS 

The results are divided in four parts. In the first part, we 
describe our novel TrAp algorithm for subclonal decon- 
volution of aggregate genomic signals consisting of aber- 
rations' frequencies (e.g. mutational allele frequencies). 
We show that the TrAp algorithm always identifies at 
least one solution. Further, we incorporated into TrAp 
an error model to handle noisy input data as well as an 
extension for handling situations where each locus can be 
affected by distinct consecutive aberrations (e.g. nucleo- 
tides which can undergo several consecutive mutations 
such as A G -* T, or C -+ T ->• G -* A). In the 
second part, we simulate noisy aggregate signals con- 
structed by random linear combinations of simulated 
subclonal aberration profiles. We used these simulated 
data to show that TrAp can correctly deconvolve 
mixtures of evolutionarily related subclones even in 
presence of levels of noise that are typically found in 
current genomics experiments. In the third part, we 
generated realistic aggregate signals by mixing subclonal 
genomic profiles obtained from cytogenetic analyses using 
random coefficients. We generate these data separately for 
each patient and show that, for nearly all aggregate 
samples, TrAp recovered the subclonal components. 
Similarly, we generated realistic aggregate signals from 
somatic hypermutated (SHM) regions from B cells. As 
we show, SHM is a particularly suitable system for the 
framework of our TrAp algorithm, which successfully re- 
covered all components from the aggregate signals. In the 
fourth part, we apply our approach to exome-sequencing 
experiments. We present an analysis of recent single-cell 
exome sequencing from a renal cell carcinoma study 
where, besides a collection of 20 single cells, an aggregate 
has also been measured. Despite the reported difference 
between the aggregate and mean aberration profile of the 
single-cell experiments, TrAp could still identify subclones 
with co-occurring aberrations consistent with some of the 
co-occurring aberrations found in direct single-cell meas- 
urements. We then apply the TrAp algorithm to study 
the response to chemotherapy of eight AML patients by 
comparing the subclonal composition before and after 
treatment. Finally, we analyze three metastases from 
separate body compartments of a melanoma patient and 
compare their inferred evolutionary patterns in a genomic 
region surrounding the Deleted in Colorectal Cancer 
(DCC) gene. 

A brute-force algorithm for solving the subclonal 
deconvolution problem 

To develop a fast algorithm to solve the subclonal decon- 
volution problem, we first derive some useful properties 



that every TrAp solution must satisfy (see Supplementary 
Note A). First, we note that the evolutionarity and 
sparsity constraint imply that the evolutionary trees 
must have exactly N clones. We term such a solution an 
A-solution and its evolutionary tree an A-tree. In this 
setting, mutations happen only once during the evolution- 
ary process and cannot be lost at later stages in evolution. 
We therefore can define C, as the subclone for which the 
i'-th aberration occurs for the first time. 

Next, we note (see Supplementary Note B for derivation 
and a more detailed description) that, if we consider two 
subclones C, and Q such that y t > yj (which implies / < j 
because the y vector is sorted), the subclone C, cannot be a 
descendant of C,. This property implies that all evolution- 
ary trees can be generated by a simple iterative procedure, 
which starts from the wildtype clone C\ and adds the 
subclone C, to all trees generated in the step i—l 
(Supplementary Figure S2). The upper bound on the 
number of possible evolutionary TV-trees is thus (N — 1)!, 
as every subclone i can only be the direct descendant of 
i— 1 subclones. This bound is significantly smaller than the 
number of all possible trees with N labeled vertices, which 
is N N ~ 2 according to Cailey's formula (125,126). 

The TrAp algorithm 

We now rewrite the subclonal deconvolution problem (see 
Supplementary Note C for derivation) as follows: 

x = y - 4>y, (2) 

where <t> is the parent indicator matrix, whose element <j>g 
(which we call parent indicator variable) is 1 if C, is the 
parent of Cj (i.e. if subclone C, is the result of a single 
aberration event in subclone C,), and 0 otherwise. An im- 
portant corollary of this equation is that the subclone C t is 
not populated if and only if 

N 

./=1 

In other words, the clone C, is not populated when the 
aggregate frequency y,- of aberration i is equal to the sum 
of the aggregate frequencies of all the direct descendants 
of the subclone Q. Therefore, the number of non- 
populated subclones of the TV-tree encoded by $ is 
determined by the number of aberrations i that satisfy 
Equation (3). Moreover, to satisfy the sparsity constraint 
of a solution, we do not need to know the topology of the 
whole evolutionary tree, but only the subset of rows of the 
matrix 4> that satisfy Equation (3). We now leverage on 
this property to efficiently generate sparse solutions to the 
subclonal deconvolution problem. 

First, we group each subset of subclones that satisfy 
Equation (3) into a first-generation tree T h which we 

define as the subset of subclones JcJ',Cf', . . . ,Cjy.J such 

that the subclone Cj' is not populated (i.e. x T p * — 0) and 

that N t subclones Cf', . . . ,C^'. are its direct descendants. 
Each first-generation tree is represented by a row of the <I> 
matrix. For example, there are three first-generation trees 
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Figure 2. Identification of first-generation trees. In this example, the 
aggregate signal frequency vector y = [1,0.6,0.4,0.35,0.3,0.1] is consist- 
ent with three first-generation trees T\ = {C\,C2,C^}, Tj = {C\,C2, 
Cs,C(,) and T$ = {C^,Cs,C6}. Each first-generation tree is visualized 
as a matrix equation X = Y — <3> Y according to Equation (2) (left) 
and as a partial evolutionary tree (right). In the bottom row, the 
partial tree PT\ given by the union of the partial trees T\ and T 3 is 
shown. Question marks indicate values that are unknown as they are 
not specified by the first-generation tree or by the partial tree. 



for the aggregate signal 7= {1,0.6,0.4,0.35,0.3,0.1}, 
namely T\ = {Ci,C 2 ,C 3 }, T 2 = {Ci,C2,C 5 ,C 6 } and 
r 3 = {C3,C5,Ce} (Figure 2). We note that the optimal 
TrAp solution for this example contains the first- 
generation trees T\ and T 3 (Figure 1). Furthermore, a $ 
matrix associated with a first-generation tree must follow 
the evolutionary constraints previously described 
(V/> 1, Y^i=\^ii = 1)' an d thus, the first-generation tree 
also gives complete information about the columns of 4> 
corresponding to the direct descendant subclones Cf', . . . , 
Cjf.. For example, the first-generation tree T\ = {C\, 
C2,C$} implies that fa^ = 0 and ^3 = 0 for every i ^ 1 
(Figure 2). 

Next, we define a partial tree as a collection of first- 
generation trees {T\, . . . ,7),} that can jointly be contained 
in a full evolutionary tree. Because each first-generation 
tree can be represented by a row of the <£> matrix, a partial 
tree that is comprised of h first-generation trees can be 
represented by h rows of the 4> matrix. In the example 
above, the partial tree that contains the first-generation 
trees T { and T 3 is represented by rows 1 and 3 of the 
matrix 4> in the bottom panel of Figure 2. Similarly, to 
first-generation trees, the matrix <J> associated with a 
partial tree must follow the evolutionary constraint, 
which implies that not all combinations of first-generation 
trees give rise to partial trees. In the example above, the 
partial trees T\ and T 3 can be combined to generate the 
partial tree PT\ = {T\,T 3 } (Figure 2 bottom), whereas 
the pairs {T\,T 2 } and {12,73} cannot be combined to 



generate partial trees. Therefore, in the example above, 
the possible partial trees are PT\ = {T\,T{\, the empty 
partial tree PT 2 = {} and the partial trees PT3 = {7*i}, 
PT A = {T 2 } and PT 5 = {T 3 }. Moreover, we note that all 
iV-trees that contain a partial tree comprising of h first- 
generation trees have N — h populated subclones. This 
implies that TrAp solutions, which must satisfy the 
sparsity constraint, must also contain one of the partial 
trees comprising the maximum number of first-generation 
trees. In the example above, the optimal TrAp solution 
(Figure 1) contains the partial tree PT\, which is the 
only partial tree comprising two first-generation trees. 

All TrAp solutions contain the maximum number of 
first-generation trees, thus the TrAp algorithm dramatic- 
ally reduces the search space by identifying the optimal 
partial trees which are then used as starting points for 
rapidly building all the sparsest solutions to the subclonal 
deconvolution problem. In detail, the TrAp algorithm 
solves the subclonal deconvolution problem as follows 
(Figure 3): 

(1) Identify all first-generation trees from the aggregate 
signal vector y. 

(2) Combine all first-generation trees to generate all 
partial trees. 

(3) Discard partial trees that do not have the minimum 
number of populated subclones. 

(4) Generate all evolutionary trees consistent with the 
partial trees comprising the maximum number of 
first-generation trees. This step is done iteratively 
for each partial tree, similarly to the way described 
for the brute-force algorithm. The only difference is 
that, when the parent clone C T p l of a first-generation 

tree T t = JcJ'.Cf', . . . ,C^.J is added to the tree, the 

subclones cf', . . . ,C^'. are automatically added as 

direct descendants of Cj' . 

(5) Optimize the shallowness constraint by sorting the 
generated solutions by the depth of the generated tree. 



The performance of the TrAp algorithm is equivalent 
to the brute-force approach when there are no first- 
generation trees (i.e. when all subclones are populated), 
but it becomes superior to a brute-force approach when 
P < N. While the brute-force algorithm generates all 
the evolutionary trees compatible with the input data, 
the TrAp algorithm generates only the optimal evolution- 
ary tree(s). 

Handling measurement errors 

The models presented above show that TrAp is an efficient 
algorithm for inferring subclonal components from the 
aggregate measure. In particular, we have shown that in 
the absence of noise, TrAp returns the exact solution when 
the underlying subclonal population satisfies reasonable 
constraints and that the algorithm is always able to find 
at least one solution. However, experimental measure- 
ments are often noisy and can only have finite precision. 

In this section, we discuss two approaches to treat noisy 
input. In both error models, the input to the TrAp 
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Step 1 : 
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A 2 , A 3 , A 4 , A 5 



Figure 3. Illustration of the usage of first-generation trees and partial trees for deriving the TrAp solution of a mixture of four subclones. In this 
example, five aberrations were measured from an aggregate sample and their frequencies were y% = 0.8, y^ = 0.5, ya = 0.5, ys = 0.4 and jf, = 0.2, 
respectively. The dummy measurement y\ = 1 was also added to generate the aggregate signal frequency vector y = [1,0.8,0.5,0.5,0.4,0.2]. In the first 
step, TrAp identifies all first-generation trees, namely T\ = {C\,C2,C(,} and Tj = {C-i,Cn}. In the second step, TrAp generates the possible partial 
trees, namely PT\ = {}, PTi = {Ti}, PT-j = {T2} and PT4 = {T\,Tt)> and consequently selects only PT4 = [T\,T2], as it is the only partial tree that 
contains a maximum number of first-generation trees. In the third step, TrAp generates evolutionary trees starting from the partial tree 
PT4 = {Ti,T2}. To complete the evolutionary tree starting from PT4, the subclone C\ is positioned as the root of the tree. Because C\ is part of 
the first-generation tree T\, the subclones C 2 and C 6 are automatically added as direct descendants of C\. Next, C 3 is added as a direct descendant of 
C2. Because C3 is part of the first-generation tree T2, C4 is automatically added as direct descendant of C3. Finally, C5 is added as a direct descendant 
of C4, generating the optimal TrAp solution to the subclonal deconvolution problem. We remark that the optimal solution generated by the TrAp 
algorithm is equal to the left solution of Supplementary Figure SI and to solution SI in Supplementary Figure S2. 



algorithm requires an additional vector s of size N whose 
elements £,■ are related to the precision of the measure y t . 
The error related to the dummy variable is denoted by £1 
and is set to 0 as y\ = 1 is a constraint of the model and 
thus £1 must vanish. 

First, we examine the bound error model in which we 
assign a threshold to the error £,■ > 0 of every underlying 
aggregate signal % such that each measured signal yi will 
be in the range [yj- — £,,£;+£,•]. Equation (3) is then 
modified accordingly and we can state that the subclone 
C, defined by aberration i is not populated if and only if 

N N 
7=1 7=1 

Next, assuming normally distributed measurement errors 
we implement a normal error model using the confidence 
intervals to determine whether a subclone is populated. 
Specifically, we assume that the underlying aggregate 
signal is normally distributed around the observed 
signal, i.e. y t ~ M\yu^f). We substitute each term of the 
left-hand side of Equation (3) by normally distributed 
random variables to derive the distribution of the 
random variable r ~ yV(/x,.,a-, 2 ) with mean 
li- r =%- E y =i fajyj and variance 07 = e}+ J^jLi fatf- 
Using the distribution of r and a desired confidence level 
a > 0 (default 0.05), we can define that clone C, is not 
populated if 0 falls within the confidence interval 
[q«, <7i^J, where q a is the a quantile of the distribution of r. 



Once the error model is chosen, the algorithm generates 
all optimal TrAp-trees in a similar fashion to the noise- 
free case. The main difference is that in the first step of the 
TrAp algorithm, Equation (4) [or a confidence interval on 
the distribution r ~ A/"(/x,-,cr ( 2 )] is used instead of Equation 
(3) to identify first-generation trees. Moreover, instead of 
using back-substitution for finding the vector x, we solve 
the nonnegative linear least square problem Cx = y with 
the constraint Xk = 0 for all nonpopulated subclones k 
associated with the parents of the first-generation trees. 
This fitting allows us to obtain a value of exactly zero 
for all nonpopulated subclones and to distribute measure- 
ment error more evenly in the vector x. 

Extensions and integrations of the TrAp algorithm 

The TrAp algorithm was also generalized to deal with 
nonbinary aberrations (e.g. multiple distinct point muta- 
tions at the same nucleotide). This has been done by 
reducing this generalized nonbinary problem to multiple 
binary problems that can be solved using the core TrAp 
algorithm (detailed explanation and derivation are given in 
Supplementary Note D, Supplementary Figures S3 and S4). 
This generalized model was used to infer subclonal compos- 
ition from a mathematical mixture of the SHM data. 

Furthermore, the algorithm can be easily modified by 
imposing additional constraints that need to be satisfied at 
each step of the iterative tree reconstruction procedure. 
The contraints can be used to specify the order in which 
two mutation occur or whether two aberrations must be 
on separate evolutionary branches (i.e. they will never co- 
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occur). Such constraints are used in the extension of TrAp 
to nonbinary aberrations (Supplementary Note D). These 
constraints can also be specified when additional informa- 
tion is available to the user. For example, the aberration 
state of two nearby nucleotide positions could be observed 
simultaneously in a read pair. This additional information 
can be used to determine if two mutations are mutually 
exclusive or if one is an ancestor of the other (66). 
Moreover, if multiple samples are available for a given 
patient, a unique evolutionary tree inferred from one 
sample can be used to constrain the evolutionary trees 
of the remaining samples. 

The TrAp algorithm by default returns only the solu- 
tion^) that optimize all the constraints. In addition, the 
user can specify parameters to relax the sparsity and shal- 
lowness constraints. For example, all TV-solutions whose 
number of populated subclones is less than or equal to a 
desired number can be obtained by retaining more partial 
trees during the third step of the TrAp algorithm. The so- 
lutions produced by TrAp (or the brute-force approach) 
can then be rescored by more advanced user-defined 
fitting functions to refine the results. These fitting 
function may include terms that model the biological 
system under consideration (77) [e.g. some types of aberra- 
tions are more common during SHM (115) or during 
melanoma development in melanocytes (127)] or terms 
that model the sampling noise of a given experiment. The 
TrAp algorithm was used to deconvolve systems of up to 25 
aberrations. Although many tumors have larger number of 
nonsynonymous mutations, the effective number relevant 
for analyzing the tumor can be significantly reduced. This 
can be done by (i) considering only a subset of medically 
relevant genes, e.g. by selecting the first tier defined by 
Mardis et al. (128) or by focusing on expressed genes 
whose mutations are predicted to be deleterious in 
proteins or genes that are downregulated relative to 
normal tissue, (ii) focusing on mutations within selected 
pathways or (iii) clustering all mutations into groups with 
similar minor allele frequencies. These reduction 
approaches allow to identify meaningful aberrations and 
thus generate trees that are simpler and easier to interpret. 
Furthermore, studying a smaller number of mutations may 
be more robust to error and may allow to identify outliers 
and artifacts in the input data. In the TrAp algorithm, we 
include a clustering procedure that groups together aberra- 
tions with similar frequency (according to the error model 
chosen) before running the algorithm. More complex clus- 
tering methodologies can be applied if replicate samples 
are available or if multiple samples from the same patient 
are available. For example, Ding et al. (18) applied 
MCLUST (129,130) and clustering based on Kernel 
Density Estimation to identify three to five major clusters 
of minor allele frequencies in three conditions (normal, 
tumor, relapse) for eight AML patients. Below, we also 
reanalyzed Ding et al. (18) data using the minor allele 
frequencies of the clusters as input to the TrAp algorithm. 

Implementation of the TrAp algorithm 

TrAp was programed in Java. TrAp makes use of the Java 
Matrix package JAMA (131) for linear regression and 



code by Josh Vermaas to solve the nonnegative least 
square problem using JAMA. The Java Universal 
Network/Graph Framework (132) is used for creating pic- 
torial representations of evolutionary trees. TrAp is 
released under the GNU Lesser General Public License 
3.0 and can be downloaded from the SourceForge reposi- 
tory at the URL http://sourceforge.net/projects/klugerlab/ 
files/TrAp/. 

Deconvolution of simulated noisy aggregates 

To confirm that TrAp can correctly infer the subclonal 
composition from aggregate noisy signals with typical 
noise levels found in genomic experiments, we performed 
simulations starting with random in silico evolutionary 
trees with different numbers of aberrations N and different 
numbers of populated subclones P. For each tree, we also 
studied the effect of different magnitudes of measurement 
errors E and we investigated the operating conditions for 
which TrAp would correctly identify the true solution. 

We performed simulations by sampling genotypes 
whose size N ranged from 1 to 12 and with underlying 
number of populated subclones P ranging from 1 to 
TV — 1. The simulations were repeated for measurement 
error values E equal to 10" 2 , 10 -3 and 10~ 4 . For each 
combination of these quantities, we performed 1000 runs 
using in silico data as follows: during each run, a random 
evolutionary tree with N aberrations was generated by 
randomly assigning a parent subclone C t (j e [0,i — 1]) to 
each subclone Q. The set of P populated subclones was 
then selected by first including all leaves of the tree and 
then adding the remaining subclones randomly. The fre- 
quency of the populated subclones was randomly assigned 
and the frequency of the nonpopulated clones was set to 
zero. Next, the aggregate frequency vector y was 
calculated from the generated tree. Finally, we perturbed 
each element of y\ by adding an error s, drawn from a 
uniform distribution U{— E,E). The elements j,- = yi+Ej 
and the error s, are used as input for the bound error 
model option of the TrAp algorithm. For each aggregate 
signal from a random tree, we examined (i) whether 
the true solution (i.e. the solution associated with the 
simulated tree), which is by construction one of the 
possible solutions to the subclonal deconvolution 
problem, had the minimum number of populated 
subclones among all solutions (sparsity constraint), 

(ii) whether the true solution had the minimum number 
of populated subclones and minimum number of levels of 
the evolutionary tree among all solutions generated 
by TrAp (sparsity and shallowness constraints) and 

(iii) whether the true solution was the only TrAp 
solution (sparsity and shallowness constraints and unique- 
ness of the optimal solution). 

The results of the simulations show that aggregate 
signals from sparse trees are deconvolved correctly even 
in presence of typical noise levels of sequencing experi- 
ments (Figure 4). We note that for simulations of 
nonsparse trees, TrAp generates a large number of 
possible solutions of which only one is the true solution. 
Furthermore, in the presence of high levels of noise, the 
TrAp algorithm identifies a large number of first- 
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generation trees that satisfy Equation (4) and generates 
solution trees whose number of populated subclones is 
smaller than the number of populated subclones of the 
true solution. 



Analysis of simulated mixtures of biological data 

Deconvolution of mathematical mixtures of karyotyping 
data from single tumor biopsies 

After showing that our approach can correctly deconvolve 
aggregate signals of subclones with a tree-like genealogy, 
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Figure 4. Deconvolution of simulated data. In each table the index of 
a column represents the number of populated subclones and the index 
of a row represents the number of mutations. We generated 1000 
simulations for any pair of row and column indices (pixel) in these 
tables. We performed this analysis using different level of noise 
(error) drawn from a uniform distribution U{— E, E). The heatmaps 
(tables) show the percentage of trees in each cell for which the true 
solution has the minimum number of subclones (left panel), is a TrAp 
solution (middle panel) and is the only TrAp solution (right panel) if 
the best solution is unique. 



Table 1. Applicability of the TrAP algorithm for different number of 
aberration events and underlying subclones 
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100% (19078) 
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50% (2) 


n/a 


100% (1) 


14 


100% (94) 


100% 


(12) 


n/a 


100% (1) 
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100% 


(22) 


57% (7) 


25% (4) 
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Entries where the parsimony constraint cannot be satisfied are shown 
in bold. 



we sought to investigate whether actual subclonal popu- 
lations can be charted on evolutionary trees. For this 
purpose, we analyzed the Mitelman database, consisting 
of cytogenetic analyses of > 60 000 biopsies (see 'Materials 
and Methods'). For each tumor type, we counted how 
frequently the relationships between cancer clones from 
the same biopsy could be explained by an evolutionary 
tree that follows the evolutionarity and parsimony con- 
straints (but not necessarily the sparsity and shallowness 
constraints). We found that almost all biopsies in the 
Mitelman database can be represented by evolutionary 
trees (Table 1), with the notable exception of astrocytoma 
of grades III and IV (Supplementary Figure S5). 

We mathematically mix all karyotypes of each single 
patient from the Mitelman database and apply the TrAp 
algorithm for each of these mixtures. The ability of the 
TrAp algorithm to extract the correct underlying clonal or 
subclonal components depends on the number of actual 
components (columns) and the multiplicity of aberrations 
studied in each mixture (rows). The frequency in which 
TrAp is able to recover the correct underlying components 
is shown in percentages. The number of mixtures for a 
given size of aberration multiplex (row) and given 
number of actual underlying components (column) is 
shown in parentheses. Note that when the column index 
is greater than the row index (entries shown in bold), the 
parsimony constraint cannot be satisfied. 

Next, we investigated whether the TrAp algorithm 
could uniquely deconvolve mixtures of the cancer 
subclones within a biopsy. As these aggregate signals are 
obtained by mixing actual subclonal profiles, we consider 
these signals to be more realistic than our previous in silico 
simulations. For each biopsy, we generated 1000 in silico 
mixtures by combining the cytogenetic profiles of each 
subclone using random nonnegative coefficients. We 
then applied our TrAp method to deconvolve in silico 
mixtures of biopsies. Our results (Table 1) show that 
81.5% of the aggregate signals simulated from biopsies 
with three or more subclones were correctly deconvolved 
(i.e. in at least one TrAp solution the subclones contained 
in the biopsies were found and were present in the correct 
proportions) and that in 67.3% of these simulations there 
was only one TrAp solution to the deconvolution 
problem. Moreover, the TrAp algorithm inferred also 
intermediate nodes in the evolutionary tree that did not 
correspond to any of the cytogenetic profiles for the 
biopsy, providing a plausible picture of the evolutionary 
order in which the aberrations occurred. Figure 5 shows 
the result of two deconvolution simulations, one from a 
melanoma sample with two subclonal populations (133) 
and one from an adenocarcinoma sample with three 
subclonal populations (134). Interestingly, a small 
number of biopsies showed more clones than aberrations 
(shown in bold in Table 1). Albeit a tree-like genealogical 
relationship can be constructed, these biopsies do not 
satisfy the parsimony constraint because the number of 
subclones M is greater than the number of mutations N. 
For this reason, their genealogy cannot be reconstructed 
by the TrAp algorithm or by any other method that makes 
use of a similar parsimony constraint (88,89,93,98). 
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Deconvolution of a mathematical mixture of SHM data 
with polyallelic mutations in a single nucleotide 

SHM introduces mutations that target the variable regions 
associated with immune adaptivity in the Ig loci. In par- 
ticular, SHM involves a programmed process of muta- 
tions that affects the variable regions of immunoglobulin 
genes and starts from an initial dividing single cell (a naive 
B cell in this case). All descendants of the founder cell 
accumulate mutations and, at the same time, are subjected 
to a strong selective pressure. For this reason, SHM is a 
particularly good biological model system to test our de- 
convolution method, which imposes tree-like evolutionary 
constraints. 
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Figure 5. Deconvolution of random mixtures of three subclones. The 
boxes represent different subclones, each denoted by the list of its 
aberrations. The aberration profiles of two subclones identified by cyto- 
genetics in a melanoma biopsy (left) and the aberration profiles of three 
subclones identified in an adenocarcinoma biopsy (right) have been 
mixed in silico using random coefficients. In both cases, the mixtures 
were successfully deconvolved. Aberrations are grouped within the 
boxes according to the order of occurrence. The reconstructed evolu- 
tionary trees suggest intermediate (white boxes), probably rare, 
subclones that were not reported in the cytogenetic data. 



We considered a data set where 20 mutated nucleotides 
in the V(D)J region of the Ig locus were measured in eight 
sequences extracted from the same germinal center (see 
'Materials and Methods 1 section) (116). This data set was 
particularly interesting because of the high number of mu- 
tations found and because of the presence of polyallelic 
mutations. 

We mathematically mixed the multi-subclonal data and 
applied our TrAp algorithm taking into account that the 
SHM scenario consists of nonbinary aberrations. We 
mixed these subclones using random nonnegative coeffi- 
cients and performed 1000 simulations. In all simulations, 
TrAp was able to recover the original sequences and the 
solution was unique in 65% of the simulations. The TrAp 
solution of one simulation is shown in Figure 6. However, 
even if the solution was not always unique, in >97% of the 
simulations there were only five or less candidate solutions 
satisfying the evolutionary, parsimony and sparsity con- 
straints, all of which correctly identified at least six out of 
seven subclones. 

In addition to the identification of the underlying 
subclones, the TrAp algorithm generates evolutionary 
trees, which represent the B cell lineage during the SHM 
process. The reconstruction of B cell lineage can provide 
important insights into the mechanisms that regulate 
adaptive immunity. B cell lineage reconstruction is gener- 
ally performed using maximum parsimony constraints 
(98) using the sequences of several Ig loci as input. 
However, in contrast to these approaches, the TrAp algo- 
rithm is able to generate maximum parsimony trees when 
only the relative frequency of mutations at each nucleotide 
is available. Therefore, the TrAp algorithm can be used to 
generate parsimonious evolutionary trees when only 
partial sequence information is available, e.g. when only 
short read sequences from a single aggregate sample are 
available, or when the loci analyzed span a region that is 
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Sequence 6 (8.5%) 
34:A-T, 170:A-G, 297:A-C 
347:G-A, 349:A--T 



Sequence 7 (10.9%) 
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21 1 :A--G, 213:A-C, 228:C->T 
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349:A-T 
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145:G^C, 158:G-C, 170:A->G--C 
180:A^G, 349:A-*T 



Sequence 1 (8.8%) 
105:T-»G, 145:G-*C, 157:A-*G 
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Figure 6. Deconvolution of a random mixture of eight sequences from SHM data. Eight sequences from the Ig locus of eight cells extracted from the 
same germinal center were mixed with the random coefficients given by x = [8. 8%,22. 6%, 20. 2%, 7. 9%, 5. 7%, 8. 5%, 10. 9%, 15. 5%]. Since sequences 
five and eight are identical, they are grouped in a single clone whose relative frequency is 5.7 % +15.5 % = 21.2 %. In total, 20 mutated nucleotides 
were found in the data, and two different mutations were identified at position 170. Mutations are shown using the notation 
'position : reference -> mutated', e.g., the notation 170 : A -* G indicates that the nucleotide at position 170 was mutated from Adenine to 
Guanine. The notation 170 : A -+ G -> C indicates that the nucleotide at position 170 was mutated twice, first from Adenine to Guanine and 
then from Guanine to Cytosine. In this example, all seven subclones were correctly deconvolved by the TrAp algorithm, the frequency of the 
subclones was correctly estimated and the solution was unique. 
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too large to be fully sequenced, or when the loci analyzed 
are distributed on different chromosomes (e.g. sequences 
from both Immunoglobulin heavy and light chains). 

Analysis of tumor biopsies 

Comparison between subclonal aberration profiles inferred 
from heterogeneous cell populations and single-cell 
aberration profiles 

We analyzed data from a recent study on renal cell car- 
cinoma where two aggregate samples and 20 single cells 
were isolated from a ccRCC and subjected to exome 
sequencing. Interestingly, the original study only showed 
partial similarity between the single cells and the aggregate 
(64). However, because the single cells and the aggregate 
used in the experiments are from the same tumor, we 
sought to investigate whether any subclones inferred by 
TrAp would share a similar combination of mutations 
found in any of the single cells. 

We applied our TrAp algorithm to the aggregate sample 
and obtained an evolutionary tree consisting of three main 
subclones. Due to the lack of extensive validations, we 
limited ourselves to investigate whether mutations that 
co-occur in the TrAp solution also co-occur in single-cell 
samples. We considered the mutations that were 
validated by bioinformatics analysis [Supplementary 
Table S3A from Xu et al. (64)] and by PCR validation 
[Supplementary Table S3B from Xu et al. (64)]. The 
fraction of correctly estimated co-occurrences was 0.76 
for mutations validated by bioinformatics analysis and 
0.74 for mutations validated by PCR. 

AML data 

Next, we used our TrAp algorithm to investigate the 
clonal evolution of eight AML patients. For each 
patient, three samples (normal, tumor, relapse) were col- 
lected and sequenced by Ding et al. (18). Minor allele 
frequencies of somatic mutations were estimated from 
the sequencing data [Supplementary Tables S5a and 
S6a-g from Ding et al. (18)] and clustered using 
MCLUST for patient UPN933124 [Supplementary Table 
S5c from Ding et al. (18)] or Kernel Density Estimation 
for the other seven patients [Supplementary Table S10 
from Ding et al. (18)]. Since the frequency of each 
cluster was estimated by the median, we used median 
absolute deviation and a default scale factor of 1.4826 to 
estimate confidence intervals under the assumption of an 
underlying normal distribution (135). The aggregate signal 
y t and measurement error s, for each cluster of mutations i 
were then estimated as 

y t = MedianOyg,); £,- = 1.4826 x MAD(y Jei ), (5) 

where yj is the estimated aggregate signal of mutation /'. 

The clonal evolutions inferred by the TrAp algorithm 
(Supplementary Figures S6-S13) are in agreement with 
those inferred by Ding et al., who used deductive reason- 
ing to manually derive the subclonal evolution (18). This 
agreement was expected as all the observations used 
by Ding et al. to generate the evolutionary trees are 
corollaries of our evolutionarity and sparsity constraints 
and are therefore automatically enforced by the TrAp 



method. In addition, the TrAp program listed all evolu- 
tionary trees compatible with the input data and provide 
additional insights on the possible origin of sublclones in 
the relapses of patient UPN758168 (Supplementary Figure 
S7) and UPN452198 (Supplementary Figure S10). 

Melanoma data 

Finally, we applied our algorithm to investigate evolution- 
ary mechanisms in tumor metastases using exome 
sequencing data from three tumor metastases (TM1: left 
lateral chest wall, TM4: pleural cavity and TM3: right 
axilla) and a matched normal sample (N: left lateral 
chest wall) of one melanoma patient (121). TrAp can effi- 
ciently handle aggregate signal vectors of ~20 unique 
frequencies and therefore we perform deconvolution 
analysis only on one chromosome. We selected chromo- 
some 18, as it contains the tumor suppressor DCC gene, 
which is known to exhibit a high load of mutations only in 
melanoma (136), in contrast to low expression, loss of 
heterozygosity or epigenetic silencing in other tumors. 

To apply the TrAp algorithm, we first preprocessed the 
data and selected 19 mutations on chromosome 18 (see 
'Materials and Methods' section). We labeled each 
mutation according to the gene affected and the amino 
acid change caused by the mutation (e.g. the label 
DCC.L1099H indicates a mutation in the DCC gene 
that causes a mutation from a Leucine to a Histidine at 
position 1099 in the DCC protein). There are six muta- 
tions with >99% frequency in all samples (including the 
normal): ADNP2.G54G, ALPK2.I2157V, CD226.S307G, 
DCC.F23L, NETOLS481N and SLC39A6.E119D. The 
only other mutation found in the normal sample was 
TCEB3CL.S301C, which occurs with frequency >90% 
in all samples. Moreover, the mutations ALPK2.R136S, 
CHST9.S122N, FAM38B.V2463, LAMA1.S1577A, 
LAMA1.K2002E, MYOM1.T215M, SERPINB10. 
R246C and SLC14A2.A880T were found in all three 
tumor samples and shared a similar frequency profile. 
The mutations DSC3.A28D, DSG1.M11V and 
IMPACT. D125E were found only in the metastases 
samples TM3 and TM4 and shared a similar frequency 
profile. Finally, the mutation DCC.L1099H was found 
only in the sample TM3. 

Since none of the genomic positions analyzed contained 
polyallelic mutations, we assigned a binary state (normal/ 
mutated) to each selected genomic position and we 
estimated the aggregate signal and measurement error 
for each mutation event using a normal approximation as 



where «, is total the number of reads covering position i, 
and nij is the number of reads with a mutation in position 
i. Finally, the y and e vectors were used as input for the 
TrAp algorithm. We run the TrAp algorithm using the 
normal error model option. 

Independent runs on the three metastatic samples gave 
33 optimal solutions for TM1, 222 for TM3 and only 1 
TrAp solution for TM4. These high numbers of solutions 
are due to the substantial noise of the experiment (in the 
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Tumor Metastasis TM1 

left lateral chest wall 



C, (6%) 

ADNP2.G54G , ALPK2.I2157V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E119D 



Tumor Metastasis TM4 

pleural cavity 



C, (4%) 

ADNP2.G54G , ALPK2.I21 57V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E119D 



Tumor Metastasis TM3 

right axilla 



C, (8%) 

ADNP2.G54G , ALPK2.I2157V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E119D 



C 2 (0%) 
TCEB3CL.S301C 
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C 3 (48%) 

ALPK2.R136S, CHST9.S122N 
FAM38B.V2463I , LAMA1.S1S77A 
LAMA1.K2002E, MYOM1 T215M 
SERPINB10.R246C, SLC14A2.A880T 



C 3 (77%) 

ALPK2.R136S, CHST9.S122N 
FAM38B.V2463I , LAMA1.S1577A 
LAMA1.K2002E, MYOM1.T215M 
SERPINB1 0.R246C, SLC14A2.A880T 



C 4 (19%) 
DSC3.A28D, DSG1.M11V 
IMPACT.D125E 



C 3 (48%) 

ALPK2.R136S, CHST9.S122N 
FAM38B.V2463I , LAMA1.S1577A 
LAMA1.K2002E, MYOM1.T215M 
SERPINB1 0.R246C, SLC14A2.A880T 




C 4 (0%) 
DSC3.A28D, DSG1.M11V 
IMPACT.D125E 



C 5 (44%) 
DCC.L1099H 



Figure 7. Evolutionary trees inferred from three metastases of a melanoma patient. Each subclone in these trees is represented by a box with a list of 
mutations that includes only its new mutations (ancestral mutations can be read off by tracing back the mutation lists of all of its ancestors). 
Mutations are labeled according to the gene affected and the amino acid change caused by the mutation (e.g. the label DCC.L1099H indicates a 
mutation in the DCC gene that causes a mutation from a Leucine to a Histidine at position 1099 in the DCC protein). Highly expressed genes from 
this patient are indicated in bold. Mutations in the left branch of TM4 are more abundant than in TM1 and TM3. 44% (19%) of the subclones of 
TM3 (TM4) have mutations in DSC3, DSG1 and IMPACT. The TM3 subclone has an additional mutation in DCC. 



range 0.005-0.025) and the fact that in samples TM1 and 
TM3, two of the subclones have similar frequencies and 
are thus difficult to separate from one another. However, 
TrAp identified a unique solution in the sample TM4, 
where the three populated subclones are distributed with 
significantly different frequencies (Figure 7 middle). Next, 
we reasoned that the metastatic TM1, TM3 and TM4 
samples may share common ancestors and that their evo- 
lutionary profiles may be related. We then applied our 
TrAp algorithm while also imposing that all evolutionary 
trees must be a subset of one global evolutionary tree. This 
approach returned a unique solution for each sample, all 
of which were among the solution sets identified in the 
previous analyses. We observe that this approach can be 
very powerful because, in principle, it allows the recon- 
struction of large trees by combining several snapshots of 
the related subclonal populations. 

The results of the deconvolution are shown in Figure 7. 
We observe the presence of two main branches. Mutations 
in the left branch of TM4 (77%) are more abundant 
than in TM1 and TM3 (48%). We note that Laminin, 
alpha 1 (LAMA1), a protein that is involved in cell 
adhesion, is present in the right branch. 44% (19%) of 
the subclones of TM3 (TM4) have mutations in 
Desmocollin-3 (DSC3), Desmoglein-1 (DSG1) and 
Impact RWD Domain Protein (IMPACT). The TM3 
subclone also acquires a second mutation in the DCC 
gene (DCC.L1099H-L) in addition to the mutation 
DCC.F23L, which was hereditary. The novel mutation 
in the DCC gene occurs close to the boundary between 



the extracellular domain and the transmembrane domain 
of the protein product. The resulting Histidine amino acid 
is positively charged, opposed to the Leucine amino acid 
of the wildtype, which is neutrally charged. Since this 
change is next to the cell membrane, it may have reper- 
cussions on the functionality of the DCC protein product, 
perhaps causing inactivation, similar to the inactivation 
caused by loss of heterozygosity and transcript suppres- 
sion observed in other cancer types. 

DISCUSSION 

In the present study, we described the TrAp algorithm, a 
tool for inferring subclonal composition and abundance 
from a single aggregate measurement experiment. As we 
have shown, TrAp is robust to noise and it can deconvolve 
mixtures where multiple mutations occur at the same 
locus. TrAp efficiently enumerates all possible solutions 
that are compatible with the model constraints, thus 
always identifying the sparsest and most parsimonious 
solution(s). However, TrAp will also generate trees [cf. 
Supplementary Equation (SI)] in cases where no tree 
structure can be inferred. As we have shown, such struc- 
tures, while deviating from the true underlying population 
structure, can still capture relevant co-occurrences of mu- 
tations that are specific to certain subclones. Further, in 
contrast to parsimonious neighbor-joining approaches, 
which rely on sampling single subclones from the popula- 
tion (e.g. single-cell experiments), TrAp uses aggregate ex- 
periments as input, thus overcoming the issue of small 



el65 Nucleic Acids Research, 2013, Vol. 41, No. 1 7 



Page 12 of 15 



sampling size, which may be insufficient to cover the 
whole spectrum of subclones in a sample. We successfully 
deconvolved systems of up to 25 aberrations. Although 
this number is not large enough to consider all somatic 
mutations found in a tumor sample, this problem can be 
circumvented by clustering aberrations with similar 
frequencies before running the TrAp algorithm. 

The level of characterization achieved by subclonal de- 
convolution holds high potential for personalized 
therapies. Possible applications include the classification 
of subclones in primary tumors, the identification of the 
seeds of metastases, tracing of resistant subclones espe- 
cially after drug treatments and developing treatment 
strategies to eliminate resistant subclones. Furthermore, 
our proposed model can be applied to other medical 
problems, such as tracing bacterial or viral paths of adap- 
tation within the infected host, detailed genome-wide re- 
construction of the epigenetic differentiation program or 
class specification in the hematopoietic system or in other 
systems. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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